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Abstract 

Structural equation models and Bayesian networks have been widely 
used to study causal relationships between continuous variables. Recently, 
a non-Gaussian method called LiNGAM was proposed to discover such 
causal models and has been extended in various directions. An important 
problem with LiNGAM is that the results are affected by the random 
sampling of the data as with any statistical method. Thus, some analysis 
of the confidence levels should be conducted. A common method to eval- 
uate a confidence level is a bootstrap method. However, a confidence level 
computed by ordinary bootstrap is known to be biased as a probability- 
value (p-value) of hypothesis testing. In this paper, we propose a new 
procedure to apply an advanced bootstrap method called multiscale boot- 
strap to compute p- values of LiNGAM outputs. The multiscale bootstrap 
method gives unbiased p-values with asymptotic much higher accuracy. 
Experiments on artificial data demonstrate the utility of our approach. 



1 Introduction 

Structural equation models [1] and Bayesian networks [2, 3] have been widely 
applied to analyze causal relationships in many fields. Many methods [2, 3] 
have been developed to discover such a causal model when no prior knowledge 
on the network structure is available. Recently, a non-Gaussian method called 
LiNGAM [4] was proposed. The new method estimates a causal ordering of 
variables using passive observational data alone. The estimated ordering is 
correct if the causal relations form a linear structural equation model with non- 
Gaussian external influence variables and the sample size is infinitely large. 
In practice, however, the sample size is finite. The finite sample size induces 
statistical errors in the estimation, and the estimated ordering may not be right 
even when the model assumptions are reasonable. Thus, some analysis of the 
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statistical reliability or confidence level of the estimated ordering should be 
done. In this paper, we discuss such reliability analysis of LiNGAM. 

A common procedure to evaluate such a confidence level is statistical hypoth- 
esis testing [5] . In statistical testing, one computes a probability- value (p- value) 
of a hypothesis. The hypothesis is rejected when the p-value is not greater than 
a pre-specified level of significance, say 5%. There are several approaches to 
define a p-value. Bootstrapping [6] is a well-known computational method for 
computing confidence levels when a simple mathematical formula is difficult to 
derive. It is a resampling method to approximate a random sample by a boot- 
strap sample that is created by random sampling with replacement from the 
original single dataset. Felsenstein [7] proposed to use bootstrapping to define 
a p- value in the context of phylogenetic tree selection of molecular evolution in 
bioinformatics. He defined a p-value of a tree by a frequency called bootstrap 
probability that the tree is found to be optimal when tree selection is performed 
for a number of bootstrap replicates of the original dataset. The idea has been 
applied to other multivariate analyses including Bayesian networks [8]. 

However, it is known that the bootstrap probability is biased as a p- value [9, 
10]. The naive bootstrapping tends to give overconfidence in wrong hypotheses. 
Thus, some advanced bootstrap methods to achieve higher accuracy have been 
proposed [9,11-13]. Among others, multiscale bootstrapping [12,13] is much 
more accurate but still easy to implement and has been successful in the field 
of phylogenetic tree selection. 

In this paper, we propose to apply the multiscale bootstrap to compute con- 
fidence levels, i.e., p- values, of variable orderings estimated by LiNGAM. The 
paper is structured as follows. First, in Section [2j we briefly review LiNGAM 
and multiscale bootstrap. In Section[3]we propose a new procedure to compute 
p-values of LiNGAM outputs using the multiscale bootstrap method. The mul- 
tiscale bootstrap method is tested using artificial data in Section|4l Conclusions 
are given in Section [5] 

2 Background 
2.1 LiNGAM 

In [4], a non-Gaussian variant of structural equation models and Bayesian net- 
works, which is called LiNGAM, was proposed. Assume that observed data are 
generated from a process represented graphically by a directed acyclic graph, 
i.e., DAG. Let us represent this DAG by a mxm adjacency matrix B={6ij} 
where every bij represents the connection strength from a variable Xj to an- 
other Xi in the DAG, i.e., the direct causal effect of Xj on Xi. Let us further 
define A = (I — B)~^. The (j, i)-element aji represents the tote/ causal effect of 
Xi on Xj [14]. Moreover, let us denote by k{i) a causal order of variables Xi in 
the DAG so that no later variable influences any earlier variable. For example, a 
variable Xj is not causally influenced by a variable Xi, i.e., aji=0, if fc(j) < k{i). 
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Moreover, assume that the relations between variables are linear. Then we have 



6i j Xj + e. 



(1) 



where is an external influence variable. All external influences arc con- 
tinuous random variables having non- Gaussian distributions with zero means 
and non-zero variances, and e,; are independent of each other so that there is no 
unobserved confounding variables [3]. We emphasize that k{j)<k{i) does not 
necessarily imply that Xj influences Xi. It only implies that aji=0, and can 
be either zero or non-zero. The causal ordering k{i) only defines a partial order 
of variables, which is enough to define a DAG. In [4], the LiNGAM model ([T]) 
was shown to be identifiable without using any prior knowledge on the network 
structure. That is, the variable orders k{i) and connection strengths 6.^ are 
estimable solely based on the data matrix of x = [xi, ■ ■ ■ , Xm]"^ . In [4], a dis- 
covery algorithm based on independent component analysis (ICA) [15], which 
is called LiNGAM algorithm, was also proposed to estimate k{i) and bij. 

2.2 Bootstrap probability 

Denote by x a m-dimensional random variable vector and by X=(xi, • • • , x„) a 
random sample of size n from the distribution of x. Further, define a function 
/(X) so that /(X)=0 if a hypothesis is rejected and otherwise /(X)=l. Suppose 
that we obtain a mxn data matrix X that is generated from x, and the function 
/(X)=l. Then, it is useful to evaluate how statistically reliable the value of 
/(X)=l is since the function could return for another data matrix due to 
sample fluctuation. In [7], Felesenstein proposed to use bootstrapping [6] to 
evaluate such reliability. Let us denote by X* a q-th bootstrap sample of size 
n* , which is created by random sampling with replacement from the columns of 
X. In ordinary bootstrap, n* is taken to be n. Then, the bootstrap probability 
p^^ is defined as a frequency that /(X*)=l: 



where Q is the number of bootstrap replications. A testing procedure was 
proposed that the hypothesis is rejected if p^^ is not greater than a significance 
level a (0<q;<1), say 0.05. However, it is known that p^^ is biased as a p- 
value [9,10]. The multiscale bootstrap [12,13] corrects the bias and gives a 
more accurate p-valuc. This is explained in more detail in the next subsection. 

2.3 Unbiasedness 

To discuss the bias of a p-value, it is conventional [9] to assume that there 
exists a function g that transforms a random sample X to a i^T-dimensional 
random vector y^[yi, • • • , Uk]^ that (at least approximately) follows a Gaussian 



P 
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distribution with an unknown mean vector fi and covariance identity I, i.e., 
NxifJ-, I). Note that it is not necessary to specify the actual functional form of 
g and dimension K. Let us denote by H such a class of y that /(X)=l. Then, 
the null hypothesis /(X)=l can be described as fiGH in terms of a region in 
the parameter space. We only have to consider y to discuss the bias of a p- 
value computed based on X due to the transformation-respecting property of 
bootstrapping [6]. 

In statistical hypothesis testing, the null hypothesis fiGH is rejected when 
a p- value computed based on y, which is denoted by p{y), is not greater than 
a significance level a. A test controls a type-I error if the probability of false 
rejection under the null hypothesis is not greater than a. This is a desirable 
property of a testing procedure. Another desirable property is unbiasedness 
[5]. A test is unbiased if the probability of correct rejection under alternative 
hypotheses is not less than a, and the type-I error is also controlled. Then an 
unbiased test is formally defined to be a test that uses a p- value p{y) satisfying 

Prob{p(y) < a} < a, /x G H and Prob{p(y) < a} > a, fi i^V.. (3) 

Let us denote by dTi the boundary of H. To satisfy the inequalities above, the 
following equation needs to hold [5]: 

Prob{p(y) < a} = a, e dU. (4) 

In other words, p{y) follows a uniform distribution over the interval [0, 1]. It 
has been shown [12] that p^^ has a rather large bias to meet the unbiasedness 
condition ([1]): 

Prob{p^^(y) <a}=a + Oin-^/^), (5) 

where O(-) is the Landau symbol. Multiscalc bootstrap [12] reduces the bias. 
Let p^'^^ denote a p-value computed by multiscale bootstrap. It can be shown 
that p^^^ is approximately unbiased with asymptotic third-order accuracy: 

Prob{p^^-^(y) <a}^a + 0{n-^/^), (6) 

Thus, multiscale bootstrap gives a p- value with much higher-order accuracy 
than ordinary bootstrap. Rigorously speaking, the boundary dJi needs to be 
assumed to be smooth or approximately smooth. Otherwise, no unbiased test 
can be defined [5]. However, it has been shown that 

pMB 

less biased than 

pBP gvgj^ jf ^Ijq boundary is non-smooth [13]. 

2.4 Multiscale Bootstrap 

In [13], the theory of multiscale bootstrap [12] was extended, and a class of 
unbiased p- values including 

pMB 

^ was obtained. Let y* denote the y vec- 
tor corresponding to X* = [x*, • • • ,x*.]. Then the standard deviation of y* is 
proportional to l/\/n*, and its value relative to the case n*=n is called 'scale' 
of bootstrap resampling; this is defined by a=y^n/n* . Then the bootstrap 
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probability p^^ in ([5]) is a function of ct^, which is denoted by p^i" for clarity. 
The fundamental idea of the extended multiscale bootstrap [13] is to compute 
the bootstrap probability p^i' with the scale a^=—l, i.e., the bootstrap sam- 
ple size n*=~-n. Of course, it is impossible to set n*=—n. Therefore, one 
first select several scales cr>0, computes the bootstrap probability for each of 
the corresponding bootstrap sample sizes n/tr^ and extrapolates the bootstrap 
probabilities to (T^=— 1, i.e., n*~—n. 

We now review a procedure to compute such unbiased p- values. Let us define 
a bootstrap 2:-value by 

z^. = -$-1 ^pBP^ , (7) 

where is the inverse of the distribution function $ of the standard Gaussian 
distribution A'^(0,1). Further, let us call <tZc2 a normalized bootstrap z-value. 
Then, consider to model the changes in <tz„i along the changing the scale a by 
a model ?/'(i7^|/3), where /3=[/3o, • • • , Ph-i^ is a parameter vector of the model. 
Two model classes are proposed in [13]: 

^PUa'lO) = /3o + E 1 ^ R f -T, 0</3;,_i<l, /i>3. (9) 

^ l + /3h-l(cr-l) 

The model ([8]) is reasonable when the boundary dH is smooth, and the model 
© is preferable when "H is a cone and dH is not smooth. To estimate the 
models, a number of sets of bootstrap replicates with different scales dd {d=l, 
• • • , D) arc first created, and subsequently the bootstrap probability p^i^ for 
each scale is computed. Note that the bootstrap sample sizes may be different 
from that of the original dataset. Then, a set of scales and normalized bootstrap 
2- values {<Td, a^z 2} is obtained. Note that z 2 is computed based onp^i' using 
^ . Finally, the model parameter vector /3 arc estimated using the set of scales 
and normalized bootstrap z-values. The maximum likelihood method is applied 
since Qp^F follows a binomial distribution. A best model il^best['^'^\P) is selected 
using an information criterion AIC [16]. 

Then, a class of p- values using the best model V'&est(o'^|/3) is derived: 

Ph = j-i. — d{a^ (10) 

where o-g is taken to be unity. The right side of ([TU]) is the first h terms of 
the Taylor series of the slope oi z„2 &i 1/ a = 1, i.e., dz'^/ d[l/ around ctq. 
It can be shown that p^^ is actually equal to p^^^ in ([5]) that achieves the 
unbiasedness with asymptotic third-order accuracy. Further, pf^^ turns out to 
be the naive bootstrap probability p^^ in ([2]). The larger h gives an unbiased 
p- value with asymptotic higher-order accuracy [13]. However, it also makes the 
maximum likelihood estimation less stable. In practice, h—2 or 3 is often used. 
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Figure 1: Four regious i?^, H^2^ ^211 ^'^'^ ^21- 

3 A multiscale bootstrap procedure to assessing 
reliability of LiNGAM 

We first define null hypotheses tested. We here focus on the following four types 
of hypotheses between Xi and xj (i^j), although we can test hypotheses that 
describe the relations between more than two variables similarly: 

1. H^: a hypothesis that Xi is directly caused by Xj, and its connection 
strength is positive, i.e., 6y >0; 

2. H^: a hypothesis that Xi is directly caused by xj, and its connection 
strength is negative, i.e., bij<0; 

3. H^^: a hypothesis that Xj is directly caused by Xi, and its connection 
strength is positive, i.e., bji>0; 

4. HJl: a hypothesis that Xj is directly caused by Xi, and its connection 
strength is negative, i.e., bji<0. 

See Fig. [T] for the four regions of the parameter space in two variable cases 
around the origin that the connection strengths 612 and 621 are zeros. LiNGAM 
outputs fc(2)>/c(l) if |6i2|< 1 621 1 7 and otherwise fc(l)>fc(2) since each total effect 
Oij is equal to the corresponding direct effect 6^ in two variable cases [4]. We 
note that the signs of connections strengths are important and interesting in 
many applications [1, 17] as well as the variable orderings. Further, this way of 
dividing the space based on the signs and orderings would make the boundaries 
of the regions be closer to be smooth than solely based on the orderings and 
help the multiscale bootstrap work better. 

We now propose a new procedure to apply Multiscale Bootstrap to LiNGAM, 
which we call MB-LiNGAM: 



MB-LiNGAM procedure 
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1. Select the scales a\, au (D>2) so that n*i='n/<J^ is an integer and 
choose the number of bootstrap replicates Q. 

2. Generate Q bootstrap replicates X* ^ (<z=l, • • • , Q) for each scale ad, i.e., 
each bootstrap sample size n'^^n/a'^. 

3. Perform LiNGAM algorithm to each bootstrap replicate X* ^ and compute 
the bootstrap probabilities and Pd^{H~^) {i^j) for each scale ad, 
where p^^{H) denotes the bootstrap probability of a hypothesis H for scale 

4. Compute the multiscale bootstrap p-values and Ph'^{H~.) {i^j) 
using the procedure in Section [Z4l more specifically pop . where pff^{H) 
denotes the multiscale bootstrap p-value of H with the order h. 

In the simulations below, the ICA part of LiNGAM algorithm is run several 
times in Step [31 Each time the initial point of the optimization is randomly 
changed. The set of the estimates that achieves the largest value of an ICA 
objective function is used in the subsequent steps. It is a common practice to 
alleviate the effects of possible local maxima. 

Related work Some methods have been proposed to test significance of direct 
effects bij [4, 14]. For simplicity, let us consider two variable cases, where each 
direct effect bij is equal to the corresponding total effect aij as mentioned above. 
Those methods test if each of effects bij is zero or not and imply that k{i)<k{j) 
if '6y=0' is accepted and that k{j)<k{i) if '6ji=0' is accepted. Such a procedure 
would work if bij or bjt is exactly zero. However, in reality, the assumptions 
of the model ^ are more or less violated, and hence both of bij and bji could 
be non-zero. In such cases, those existing methods might reject both of the 
hypotheses and not give much information on which ordering is better. Even 
in the cases, our approach tells which ordering is better or statistically more 
reliable comparing bootstrap probabilities of the orderings. 



4 Simulations 

We first created three LiNGAM models with m=2 variables: 
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(11) 



where 6=0, 0.01 or 0.1, and ei and 62 followed a Laplace distribution with mean 
zero and variance two. The model with b=0 is on the boundary of H^2^ ^121 
H+, and H^^. The model with 6=0.01 or 0.1 is on the boundary between ffj^ 
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and H2i- Further, we created two LiNGAM models with m=6 variables: 
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(12) 



where 6=0 or 0.5, and ei and 62 also followed a Laplace distribution whose mean 
zero and variance two. We randomly generated 1280 datasets with sample size 
1000 under each of the five models. Then we applied MB-LiNGAM procedure in 
Section |3] to the datasets. The scales ad were selected so that they gave integer 
values of bootstrap sample size and were (approximately) equally-spaced in log- 
scale between 1/9 and 9 {d=l, • • • , 13). The number of bootstrap replicates Q 
was 1000, and the value h for pf^^^ was 3. 

The histograms of p- values of i?^ computed by ordinary bootstrap and those 
by multiscale bootstrap in the two variable cases are shown in Fig. [2] Similar 
histograms were obtained for the other conditions. Each of the histograms of 
p-values computed by multiscale bootstrap looked closer to the uniform dis- 
tribution than by ordinary bootstrap. This implied that multiscale bootstrap 
provided better unbiased p-values. 

In Fig. 3, we also show a scatterplot of empirical rejection probabilities 
by ordinary bootstrap Prob{j)^^(if32)<Q!} and those by multiscale bootstrap 
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BP, b = 0.01 



BP. b = 0.1 
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MB, 6 = 



MB, b = 0.01 



MB, b = 0.1 
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Figure 2: Top row: Histograms of p-values of i/ji by ordinary bootstrap (BP). 
Bottom row: Histograms of p- values of by multiscale bootstrap (MB). The 
uniform density functions are given by the red lines. 
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Figure 3: Scatterplots of empirical rejection probabilities of _ff^ by ordinary 
bootstrap and those by multiscale bootstrap versus significance levels in the six 
variable cases. 



Prob{p3^-^(iJ32)<Q!} versus significance levels a in the six variable cases. Plots 
for unbiased tests should be on the diagonal. That is, their rejection prob- 
abilities should be equal to the corresponding significance levels. Most of the 
plots for ordinary bootstrap are far above the diagonal, indicating that ordinary 
bootstrap gave rather biased p-values and tended to reject reasonable hypothe- 
ses much more often than the nominal frequencies or significance levels. In 
contrast, the plots for multiscale bootstrap are much closer to the diagonal. 
This showed that multiscale bootstrap provided much better unbiased p- values. 

5 Conclusion 

We proposed a new procedure to evaluate statistical reliability of LiNGAM. Our 
procedure gives p-values of variable orderings estimated by LiNGAM and tells 
which orderings are more reliable. The utility of our procedure was demon- 
strated in the simulations. Future work would investigate how sensitive to 
non-smoothness of the boundaries of hypothesis regions our method is and how 
it is alleviated, although the simulations implied that it might be not very prob- 
lematic. 
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